OcrV1, Main, Exploration, bibRecord, 001D78

Block addressing indices for approximate text retrieval

Identifieur interne : 001D78 ( Main/Exploration ); précédent : 001D77; suivant : 001D79

Block addressing indices for approximate text retrieval

Auteurs : Ricardo Baeza Ates [Chili] ; Gonzalo Navarro [Chili]

Source :

Journal of the American Society for Information Science [ 0002-8231 ] ; 2000.

RBID : ISTEX:B2935122C6D520FDE322924057BFFA8D4F266930

Abstract

The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and get more heterogeneous and error prone, is that of flexible string matching. One of the best tools to make the search more flexible is to allow a limited number of differences between the words found and those sought. This is called “approximate text searching,” which is becoming more and more popular. In recent years some indexing schemes with very low space overhead have appeared, some of them dealing with approximate searching. These low overhead indices (whose most notorious exponent is Glimpse) are modified inverted files, where space is saved by making the lists of occurrences point to text blocks instead of exact word positions. Despite their existence, little is known about the expected behavior of these “block addressing” indices, and even less is known when it comes to cope with approximate search. Our main contribution is an analytical study of the space‐time trade‐offs for indexed text searching. We study the space overhead and retrieval times as functions of the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. This surprising analytical conclusion is validated with extensive experiments, obtaining typical performance figures. These results are valid for classical exact queries as well as for approximate searching. We apply our analysis to the Web, using recent statistics on the distribution of the document sizes. We show that pointing to documents instead of to fixed size blocks reduces space requirements but increases search times.

Url:

https://api.istex.fr/document/B2935122C6D520FDE322924057BFFA8D4F266930/fulltext/pdf

DOI: 10.1002/(SICI)1097-4571(2000)51:1<69::AID-ASI10>3.0.CO;2-C

Affiliations:

Chili

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 001978
to stream Istex, to step Curation: 001876
to stream Istex, to step Checkpoint: 001381
to stream Main, to step Merge: 001E78
to stream Main, to step Curation: 001D78

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Block addressing indices for approximate text retrieval</title>
<author><name sortKey="Baeza Ates, Ricardo" sort="Baeza Ates, Ricardo" uniqKey="Baeza Ates R" first="Ricardo" last="Baeza Ates">Ricardo Baeza Ates</name>
</author>
<author><name sortKey="Navarro, Gonzalo" sort="Navarro, Gonzalo" uniqKey="Navarro G" first="Gonzalo" last="Navarro">Gonzalo Navarro</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:B2935122C6D520FDE322924057BFFA8D4F266930</idno>
<date when="2000" year="2000">2000</date>
<idno type="doi">10.1002/(SICI)1097-4571(2000)51:1<69::AID-ASI10>3.0.CO;2-C</idno>
<idno type="url">https://api.istex.fr/document/B2935122C6D520FDE322924057BFFA8D4F266930/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001978</idno>
<idno type="wicri:Area/Istex/Curation">001876</idno>
<idno type="wicri:Area/Istex/Checkpoint">001381</idno>
<idno type="wicri:doubleKey">0002-8231:2000:Baeza Ates R:block:addressing:indices</idno>
<idno type="wicri:Area/Main/Merge">001E78</idno>
<idno type="wicri:Area/Main/Curation">001D78</idno>
<idno type="wicri:Area/Main/Exploration">001D78</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Block addressing indices for approximate text retrieval</title>
<author><name sortKey="Baeza Ates, Ricardo" sort="Baeza Ates, Ricardo" uniqKey="Baeza Ates R" first="Ricardo" last="Baeza Ates">Ricardo Baeza Ates</name>
<affiliation wicri:level="1"><country xml:lang="fr">Chili</country>
<wicri:regionArea>Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago</wicri:regionArea>
<wicri:noRegion>Santiago</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Chili</country>
</affiliation>
</author>
<author><name sortKey="Navarro, Gonzalo" sort="Navarro, Gonzalo" uniqKey="Navarro G" first="Gonzalo" last="Navarro">Gonzalo Navarro</name>
<affiliation wicri:level="1"><country xml:lang="fr">Chili</country>
<wicri:regionArea>Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago</wicri:regionArea>
<wicri:noRegion>Santiago</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Chili</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Journal of the American Society for Information Science</title>
<title level="j" type="abbrev">J. Am. Soc. Inf. Sci.</title>
<idno type="ISSN">0002-8231</idno>
<idno type="eISSN">1097-4571</idno>
<imprint><publisher>John Wiley & Sons, Inc.</publisher>
<pubPlace>New York</pubPlace>
<date type="published" when="2000">2000</date>
<biblScope unit="volume">51</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="69">69</biblScope>
<biblScope unit="page" to="82">82</biblScope>
</imprint>
<idno type="ISSN">0002-8231</idno>
</series>
<idno type="istex">B2935122C6D520FDE322924057BFFA8D4F266930</idno>
<idno type="DOI">10.1002/(SICI)1097-4571(2000)51:1<69::AID-ASI10>3.0.CO;2-C</idno>
<idno type="ArticleID">ASI10</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0002-8231</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and get more heterogeneous and error prone, is that of flexible string matching. One of the best tools to make the search more flexible is to allow a limited number of differences between the words found and those sought. This is called “approximate text searching,” which is becoming more and more popular. In recent years some indexing schemes with very low space overhead have appeared, some of them dealing with approximate searching. These low overhead indices (whose most notorious exponent is Glimpse) are modified inverted files, where space is saved by making the lists of occurrences point to text blocks instead of exact word positions. Despite their existence, little is known about the expected behavior of these “block addressing” indices, and even less is known when it comes to cope with approximate search. Our main contribution is an analytical study of the space‐time trade‐offs for indexed text searching. We study the space overhead and retrieval times as functions of the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. This surprising analytical conclusion is validated with extensive experiments, obtaining typical performance figures. These results are valid for classical exact queries as well as for approximate searching. We apply our analysis to the Web, using recent statistics on the distribution of the document sizes. We show that pointing to documents instead of to fixed size blocks reduces space requirements but increases search times.</div>
</front>
</TEI>
<affiliations><list><country><li>Chili</li>
</country>
</list>
<tree><country name="Chili"><noRegion><name sortKey="Baeza Ates, Ricardo" sort="Baeza Ates, Ricardo" uniqKey="Baeza Ates R" first="Ricardo" last="Baeza Ates">Ricardo Baeza Ates</name>
</noRegion>
<name sortKey="Baeza Ates, Ricardo" sort="Baeza Ates, Ricardo" uniqKey="Baeza Ates R" first="Ricardo" last="Baeza Ates">Ricardo Baeza Ates</name>
<name sortKey="Navarro, Gonzalo" sort="Navarro, Gonzalo" uniqKey="Navarro G" first="Gonzalo" last="Navarro">Gonzalo Navarro</name>
<name sortKey="Navarro, Gonzalo" sort="Navarro, Gonzalo" uniqKey="Navarro G" first="Gonzalo" last="Navarro">Gonzalo Navarro</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001D78 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001D78 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:B2935122C6D520FDE322924057BFFA8D4F266930
   |texte=   Block addressing indices for approximate text retrieval
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Block addressing indices for approximate text retrieval

Block addressing indices for approximate text retrieval

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri